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Abstract 

In this paper we consider the problem of Gaussian process classifier (GPC) model selection 
with different Leave-One-Out (LOO) Cross Validation (CV) based optimization criteria 
and provide a practical algorithm using LOO predictive distributions with such criteria to 
select hyperparameters. Apart from the standard average negative logarithm of predictive 
probability (NLP), we also consider smoothed versions of criteria such as F-measure and 
Weighted Error Rate (WER), which are useful for handling imbalanced data. Unlike the 
regression case, LOO predictive distributions for the classifier case are intractable. We 
use approximate LOO predictive distributions arrived from Expectation Propagation (EP) 
approximation. We conduct experiments on several real world benchmark datasets. When 
the NLP criterion is used for optimizing the hyperparameters, the predictive approaches 
show better or comparable NLP generalization performance with existing GPC approaches. 
On the other hand, when the F-measure criterion is used, the F-measure generalization 
performance improves significantly on several datasets. Overall, the EP-based predictive 
algorithm comes out as an excellent choice for GP classifier model selection with different 
optimization criteria. 

Keywords: Gaussian process classification, Model Selection, LOO, Cross Validation, 
Predictive distributions, Smoothed F-measure, Weighted Error Rate, Precision, Recall, 
Imbalanced data 



1. Introduction 

Gaussian process (GP) models are flexible and powerful probabil istic models that are used 
to so lve classification problems in many areas of application ( Rasmussen and Williams] . 
200fih . In the Bayesian GP setup, latent function values and hyperparameters that are 



involved in modeling are integrated with chosen priors. However, the required integrals 
are often not analytically tractable (due to various choices of likelihoods and priors) and 
closed form analytic expressions are not available. Two important problems in this context 
are finding good approximations for integrating over the latent function variables and the 
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hyperparameters. There have been two approaches reported in the literature. In the first 
approach, both the latent fu nction variables and t h e hyp erparameters are integrated out 
within some approximations. IWilliams and Barber (|l998l ) used Laplace approximation to 
integrate over the latent fu nction varia bles and Hybrid Monte Carlo (HMC) to integrate 
over the hyperparameters. Neall ( 19981 ) used Gibbs sampling to integrate over the latent 
function variables and HMC to integrate over hyperparameters; this method is more accu- 
rate, but it is computationally expensive. In the second approach, only the latent function 
variables are integrated out and the hyperparameters are optimized on some well-defined 
objective function. This latter problem of choosing hyperparameters that define the model 
is essentially the Gaussian process model selection problem and in this paper we focus on 
this problem. 

There are two commonly used approaches to address this model selection problem. They 
are marginal likelihood or evidence maximization and minimization of LOO-CV based av- 
erage negative logarithmic predicti ve probability (NLP). Both these approaches are available 
for G P regression model selection (Rasmussen and Williams 



2006: 



Sundararaian and Keerthi . 



200lh . For GP classifier model selection. iGibbs and MacKayl (j2000l ) used a variational ap 
proximation method to integrate over the latent function values and estimated the hyperpa- 
rameters by maximizing marginal li kelihood (ML). Laplace approximation and E xpectation 
Propagation (EP) approximation ( Rasmussen and Williams! 20061 : Seeger . 20031 ) are other 
methods that have been used to integrate over the latent function variables. Then, the 
marginal likelih ood is optimized using gradient i nformation o b tained with any one of t hese 
approximations ( Rasmussen and Williams] . 2006 : Seegerl . 2003 ). Kim and Ghahramani ( 20061 ) 
presented an approximate Expectation-Maximization (EM) algorithm to learn the hyper- 
parameters. In the E-step, they used EP to estimate the joint density of latent function 
values and in the M-step, the hyperparameters were optimized by maximizing a variational 
lower bound on the marginal likelihood. 

In this paper, we consider the approach of using LOO-CV based pr edictive distributions 
to add ress the GP classifier model selection problem. In a related work, lOpper and Winther 
(|2000h used LOO error estimate t o choose roug h hyperparameter values by scanning a range 
of values. In the EP framework ( Minkal . 1200 ll ). cavity distributions directly provide LOO 
error estimates during training and were used to assess predictiv e performance a nd select au- 



tomatic relevance determination (ARD) type hyperparameters ( Qi et al. . 20041 ) . The LOO 



CV based predictive distributions obtained from probabilistic l east squares classifier were 
used in the minimization of NLP for GP classifier model selection ( Rasmussen and Williams] . 
20061 ). 



In practice, while measures like marginal likelihood and average negativ e logarithmic 
predi ctive probability measures are very useful, other measures like F-measure (Ivan Rijsbergenl . 



19741 ) and Weighted Error Rate (WER) are also important and useful, for instance in ap- 



plications like medical diagnostics, image understanding, etc, where the number of positive 
examples is much smaller than the number of negative examples. Several wo rks that u se 



such m easures for hyperparameters optimization exist in the non-GP literature. iHong et al 



(|2007h proposed a kernel classifier construction algorithm based on regularized orthogonal 
weighted least squares (ROWLS) estimation with LOO- Area Und er the ROC Cu rve (AUC) 
as model selection criterion for handling imbalanced datasets. Jansche ( 20051 ) proposed 
the training of a probabilistic classifier based on a logistic regression model by optimizing 
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expected F-measure. Here an approximation to F-measure was ma de so that the F-measure 
is smoothed and becomes a smooth function of model weights. Seeger (|2007h proposed 
a general framework for learning in probabilistic kernel classification models with a large 
or structured set of classes . He o ptimized the kernel parameters by minimizing the NLP 
over k-folds. iKeerthi et all (j200fih considered the task of tuning hyperparameters in SVM 
models based on minimizing smooth performance validation functions like smoothed k-fold 
CV error. The last three works did not use LOO-CV in their work. 

This paper is aimed at addressing two i ssues. First, the prop osed method is different 



l ms paper is aimed at addressing two i ssues. Jirst, tne prop osed metnod is dirierent 
from the work of lOpper and Winther (2000) and bi et all (|2004r i in that we optimize the 
hyperparameters directly in the continuous space and any standard non-linear optimization 
method can be used. It is also d ifferent from the LOO-CV based probabilistic LS method 
( Ras mussen and Williamsl . lioOrih in that we use the more accurate EP approximation than 
the LS approximation. Second, criteria such as F-measure and WER, which are needed for 
tackling imbalanced problems, have not been considered in GP classifier designs. 

We define smoothed LOO-CV based measures using predictive distributions as a func- 
tion of GP classifier model hyperparameters. Thus, the objective functions can be opti- 
mized using standard non-linear optimization techniques. We investigate usage of LOO- 
CV predictive distributions obtained from expectation propagation approximation. Ac- 
tually, the proposed algorit hm can also be used with Laplace approximation. However, 
Kuss and Rasmussen! ()2005l ) showed for binary classification problems that the EP approx- 
imation is better than the Laplace approximation. Therefore, we restrict our attention to 
using the EP approximation here. 

We conduct experiments on two criteria: the standard average negative logarithm of pre- 
dictive probability (NLP) and a smoothed version of F-measure. On the NLP criterion we 
compare our m ethod (EP-CV(NLP)) aga i nst th e LOO-CV based probabilistic least squares 
(LS) classifier (jRasmussen and Williamsl . lioOfih and standard GP classifier doing ML max- 
imization using EP approximation. We refer the latter two methods as LS-CV(NLP) and 
EP(ML) respectively; the abbreviations in parentheses refer to the type of objective func- 
tions used. The experimental results on several real world benchmark datasets show that 
the proposed method is better than the LS-CV(NLP) method and is quite competitive to 
the EP-ML method. On the F-measure criterion we compare the EP-CV(NLP) method 
with the EP-CV(FM) method. In the latter method we optimize over the F-measure in- 
stead of the NLP measure. We also compare with a two-step methocO] where, in the first 
step we optimize over the hyperparameters using the NLP measure and, in the second 
step we optimize only the bias hyperparameter using the F-measure. Experimental results 
demonstrate that this method is also inferior to the EP-CV(FM) method. 

The paper is organized as follows. We give a brief introduction to Gaussian process 
classification and ML optimization criteria in Section 2. In Section 3 we give a general 
set of smooth LOO-CV based optimization criteria and illustrate their use with LOO-CV 
predictive distributions. Optimization aspects and specific algorithm are given in Section 
4. In Section 5 we discuss related work on LOO-CV based GPC model selection. In Section 
6 we present experimental results and then conclude the paper in Section 7. 



1. This method was suggested by an anonymous reviewer. 
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2. Gaussian Process Classification 

In binary classification problems, we are given a training data set S composed of n input- 
target pairs (Xj, j/j) where Xj € R D , yi G {+1,-1}, % G I and I = {1, 2, . . . , n}. The 
true function value at Xj is represented as a latent variable /(xj). The goal is to compute 
the predictive distribution of the class l abel y* at test location x*. In standard GPs for 
classification ( Rasmussen and Williams] . 20061 ) . the latent variables /(xj) are modelled as 



random variables in a zero mean GP indexed by {xj}. The prior distribution of {f(X n )} 
is a zero mean multivariate joint Gaussian, denoted as p(f) = A/"(0, K), where f = 
[/(xi), . . . , /(x n )] T , X n = [xi, . . . ,x n ] and K is the n x n covariance matrix whose (i,j) th 
element is k(xi,Xj), denoted as Kjj. One of the most commonly used covariance functions 
is the squared exponential covariance function given by: cov(/(xj), /(xj)) = fc(xj,Xj) = 

/3q exp(— | Ylk=i „ Xj ' k ^ ). Here, /3q represents signal variance and the remaining /3^'s rep- 
resent width parameters across different input dimensions. Let (3 == \fio, /3\, . . . , /?£>]. These 
parameters are also known as automatic relevance determination (ARD) hyperparameters. 
We call this covariance function the ARD Gaussian kernel function. Next, it is assumed 
that the probability over class labels as a function of x depends only on the latent function 
value /(x). For the binary classification problem, given the value of /(x) the probability 
of class label is independent of all other quantities: p{y = +l|/(x), S) = p{y = +l|/(x)) 
where S is the dataset. The likelihood p{yi\fi) can be modelled in several forms such as 
a sigmoidal function or cumulative normal <&(yifi) where &(z) = f*^ exp(— ^-)dw. 



Assuming that the examples are i.i.d, we have p(y[f ) = YliLi P{Vi\fu l)- Here, 7 represents 
hyperparameters that characterize the likelihood. The prior and likelihood along with the 
hyperparameters = [/3, 7] characterize the GP model. With these modelling assumptions, 
we can write the inference probability given as: 

p(y*|x*,S,0) = J p(y*\f*,j) p(f*\S,**,9) df* (1) 

Here, the posterior predictive distribution of latent function is given by: 

p(/*|S,x*,0) = J p(f*\x*,f,(3)p(f\S,e)df. 

where p(f\S,6) oc YliLi P(Ui\fii 7) f(f |X, (3). In a Bayesian solution, the class probability 
at the test point would be obtained by integrating over the hyperparameters weighted 
by their posterior probability 



p(y*\x*,S) = J p(y*\x*,S,9) p(0\S) d6. 



In general there is no closed form expression available for this integral and it is expensive to 
compute. Therefore, instead of integrating over the hyperparameters, a single set of their 
values is usually estimated from the dataset by optimizing various criteria as mentioned 
earlier and then used in ([!]). 
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2.1 Marginal Likelihood Maximization 

Marginal likelihood or evidence maximization ( Rasmussen and Williams . 20061 ) is commonly 



used to estimate the hyperparameters during model selection. The marginal likelihood is 
given by: 

N 

p(y|X,0) = / Hp( yi \f in )p(f\X,P)df (2) 

i=i 

This integral cannot be calculated analytically except for a special case like GP regression 
with Gaussian noise. Therefore, certain approximations are needed to compute these quan- 
tities. Laplace approximation and EP approximations are two popular methods used for 
this purpose. To gain more insight into the quality of the Laplace and EP approximations, 
Kuss and Rasmussen (j2005l ) carried out comprehensive comparisons of these approxima- 



tions with (the more exact) Markov Chain Monte Carlo (MCMC) sampling approach in 
terms of their predictive performance and marginal likelihood estimates. They found that 
EP is the method to be used for approximate inference in binary GPC models, when the 
computational cost of MCMC is prohibitive. Hence in our study we restrict ourselves to the 
more accurate EP approximation given in the next subsection. With the EP approximation 
the hyperparameters are learn t by optimizing ma rginal likelihood with gradient informa- 
tion ( Rasmussen and Williams! . 2006 : Seeger . 2003) using standard non-linear optimization 



techniques. 

2.2 Expectation Propagation 

The EP algorithm is an iterative a lgorithm whi ch is used for approximate Bayesian in- 



feren ce (iQpper and W inthcr , 2000: Minka, 2001). It has been applied to GP classifica- 
tion ( Rasmussen and WilliamsT 20061 : ISeegerl . 120031 ) . EP finds a Gaussian approximation 



q(f\S,6) = A/"(f|m, C) to the posterior p(f\S,6) by moment matching of approximate 
marginal distribution and the posterior. Mean and covariance of the approximate Gaussian 
are given by: 

m = CirV C = (KT 1 + 5T 1 )" 1 (3) 

where /x = (fj,\, fi2, ■ ■ ■ , ^ n ) T and S = diag (af, a^, <J^) are called site function pa- 
rameters. As per the approximation, the posterior is written in terms of the site functions 
t(fi]/jbi,af,Zi) = ZiN(fi\fj,i,crf) and prior p(f|X,0) as 

N 



The EP algorithm iteratively visits each site function in turn, and adjusts the site parameters 
to match moments of an approximation to the posterior marginals. This process requires 
replacement of intractable exact cavity distribution with a tractable approximation based 
on the site functions and is given by: 



M/0 = JilKfr^j^l^pW*' 9 )^ 

3& 
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where q\i(fi) oc A/"(/i|jU\j, cr?J is the approximate cavity function and is related to the 
diagonal entries of the posterior q({\S,6) as: q\i(fi)t(fi; fa, of , Zi) oc N(fi\mi,Cu). Here, 
the Ca represent the diagonal entries of the matrix C. Using Gaussian identities, the mean 
and variance of cavity distribution is related to the site parameters as: 

= 4^ " 4) ^ = ((C*)" 1 - ofT 1 (4) 

Then, EP adjusts the site parameters of and Zj such that the approximate posterior 
marginal using the exact likelihood approximates the posterior marginal based on the site 
function well. That is, q\i(fi)p(yi\fi) ^ q\i(fi) t{fi, fa, of, Zi). This is done by matching the 
zeroth, first and second moments on both sides. Thus, the EP algorithm iteratively updates 
site parameters until convergence. Tho ugh there is no convergence proof , in practice the EP 
algorithm converges in most cases. See iRasmussen and Williams! l|2006l ) for more details. 



Next, within some constant, the marginal likelihood with EP approximation (IRasmussen and Williams 

2006h is given by: 



11 

logg(y|X,0) = --log|K + E| - -/x T (K + S)" V + 2>gx«i (5) 

i=i 

where Wi = $(zi) exp ( ^\ ^ 1 \ , \ . I o? i + of and z% = ■ The hyperparameters are 

optimized using gradient expressions with standard conjugate gradient or quasi-Newton 
type non-linear optimization techniques. 



3. Leave-One-Out Cross Validation based Optimization Criteria 

In this section, we give definitions of various LOO-CV based optimization criteria. In 
section 4 we give details on how these measures can be optimized using standard nonlinear 
optimization techniques. The LOO predictive distributions p(?/j|xj, Su, 0), i € I play a 
crucial role. Here Sw represents the dataset without ith example. Their exact computation 
is expensive. In section 4 we will also discuss how to approximate them efficiently. 



3.1 NLP Measure 

The averaged negative logarithm of predictive probability (NLP) is defined as: 



1 n 



(6) 



i=l 



This LOO-CV based measure is generic and has been used in the context of probabilistic LS 

class ifiers (see section 5) and GP regression ( Rasmussen and Williams] . 2006 : Sundararaian and Keerthi 
l2Q0lh . 



3.2 Smoothed LOO Measures 

While measures such as marginal likelihood ([5]) and NLP in © are useful for normal 
situations, other measures like F-measure and WER are important, for example, when 
dealing with imbalanced datasets. Let us now define these measures. 
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Table 1: Confusion Matrix for Binary classification 





Positive (Predicted) 


Negative (Predicted) 


Positive (Actual) 


a 


b 


Negative (Actual) 


c 


d 



Consider the binary classification problem with class labels +1 and -1. Assume that 
there are n+ positive examples and n_ negative examples. In general, the performance 
of the classifier may be evaluated using counts of data samples {a, b, c, d} defined via the 
confusion matrix given in Table 1. Let n + = a + b and n_ = c + d. The true positive 
rate (TP) is the proportion of positive data samples that were correctly classified (that is, 
true positives) and the false positive rate (FP) is the p roportion of nega tive data samples 



that were incorrectly classified (that is, false positives) ( Hong et al. . 20071 ), These rates are 



given by: TP = -tt = — and FP = — = — . The misclassification rate is given 

° J a+b ra+ c+d ri— ° 

by: MCR = ^j-p. Note that the true positive rate is also known as Recall (R). Precision is 
another important quantity defined as: P = 

Now let us consider the imbalanced data case and assume that n_ n + . In this case 
if MCR is minimized then the classifier will be biased toward the negative class due to 
its effort in minimizing the false positives (that is c) more strongly than minimizing false 
negatives (that is b). In the worst case almost all the positive examples will be wrongly 
classified, that is a — > . This results in both P —¥ and R — > 0. Thus MCR is not a 
good measure to use when the dataset is imbalanced. This problem can be addressed by 
optimizing other measures that we discuss next. 



The F-measure is one such measure and is defined (jvan Riisbergenl . Il974l ) as: 



F ( (P.R) 



R P 



where < £ < 1 and < Ff(P,R) < 1 . It has been used in v arious applicatio ns like 
document retrieval (jvan Riisbergenl . Il974l ) and text classification (j Joach ims. 2005). It is 



particularly p referable over MCR when the dataset is highly imbalanced (jJoachimsl . 120051 : 
Janschel . [2005l ). 

Let us get into more details on the functioning of the F-measure. When £ — > 0, we get 
F^(P, R) — > P- Then optimizing the F-measure means we are interested only in maximizing 
the Precision. On the other hand, when C — > 1, we get F^(P,R) — > R. In this case we 
are interested only in maximizing the Recall. The user can choose an appropriate value for 
£ depending on how much of importance he/she wants to give to the precision and recall. 
Thus, the F-measure combines precision and recall into a single optimization criterion by 
taking their ("-weighted harmonic mean. In the imbalanced data case mentioned above 
MCR minimization can potentially result in F^(P,R) — > 0. By maximizing the F-measure 
we can prevent the classifier from being completely biased towards the negative class. Note 
that Fr(P, R) can be re- written in terms of a, b and c as: 



F ? (a,6, c) 



a + (b+(l-()c 



(7) 
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0.5. 



In this case, it becomes Fq^(P,R) 
Then, maximizing ^0.5(0, b, c) 



b+c 
2a 



- T+R an<1 

is equivalent 



In all our experiments we set £ = 
can also be written as: Fo^(a,b,c) 

* ' 2a 

to maximizing Thus, we can maximize Fo^(a, b, c) by both minimizing the error (that 
is, b + c) and maximizing the true positives. The trade-off kicks-in since maximizing the 
true positives tends to increase the false positives. Thus maximizing the F-measure controls 
both the true positives and the error appropriately. In general the F-measure summarizes 
a classifier's ability to identify the positive class and plays an important role in the evalu- 
ation of binary classifier. As a criterion for optimizing hyperparameters F-measure can be 
computed on an evaluation or validation dataset. However, in practical situations involving 
small datasets^] it is wasteful to employ a separate evaluation set. The LOO-CV approach 
would be useful in such situations and we show next how the F-measure can be estimated 
wit h such approach. 



Hong et all (|2007h estimated TP and FP as: 



I n 

TP = — y2T(y\iyi,yi), 
+ i=i 

1 n 

FP = - Vf(p,K). 
i=l 

Here y\s represents predicted label for ith sample. Therefore ywyi takes value +1 when the 
prediction matches with the actual label and —1 otherwise. T(u,v) is an indicator function 
which is 1 if u = 1 and v = 1. Similarly, F(u,v) is one if u = — 1 and v = —1. Otherwise, 
these functions take zero values. Hong et aD ( 2007 ) used these estimates to compute AU C = 
1 + TP ^ FP as an approximation of AUC and used this criterion to select a subset of basis 
vectors in a kernel classifier model construction procedure for imbalanced datasets. Note 
that this definition of AUC is app licable only for a hard classifier (fixed non-probabilistic 
classifier) with binary outputs. See Hong et al. ( 20071 ) for more details. In a strict sense such 
a definition of AUC is not suitable for a probabilistic classifier like GP classifier that provides 
continuous probabilistic output. However, we can make use of this approach of defining TP 
and FP as above to compute the quantities a, b and c that are needed to evaluate the F- 
measure in ([7]). There are two issues associated with these estimates. The first issue is that 
these estimates are not smooth (in fact, not even continuous) functions of hyperparameters. 
Therefore they cannot be used directly in any approach that uses gradient-based nonlinear 
optimization methods to tune the hyperparameters. Secondly, these estimates do not use 
predictive probability values which is p articularly imp ortan t when we wan t to ta ke variance 
also into account. In non- GP contexts Ijanschel (|2005h and iKeerthi et aD (|2006l ) addressed 
the first issue by defining smoothed F-measure or other validation functions by replacing 
the indicator function with a sigmoid function, which makes the optimization criterion as a 
smooth function of hyperpar ameters. Howev er, they did not consider a LOO approach and 
use d a validation set ins tead. Jansche ( 20051 ) considered maximum a posterior probabilities 
and lKeerthi et al.1 (|2006h used sigmoidal approximations for SVM models. Here, we propose 
to combine LOO based estimation and smoothed version of the quantities {a, b, c, d} denoted 



2. GP models are known to be particularly valuable for problems with small datasets. 
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as A(0), B(0), C{6) and D{0). We can set 

A{6) = £ p{y i = +l\x i ,S v ,e) (8) 

Since n + = a + b, we can write B(6) = n + — A(6). With m + denoting the number of 
examples predicted as positive, we can parameterize it as m + {6) = A(6) + C(6). This can 
be rewritten as: 

n 

m +( 9 ) = $>(2/i = +lKS v > ) ( 9 ) 

i=l 

Thus, the smoothed F-measure can be defined from ([7]) as: 

= , T ,f ( °L 7aY (10) 

Cn+ + (1 - C)ra+(0) 

Note that can be defined in a similar fashion as m_(0) = + D{0). Using these 

quantities, other derived quantities like TP (6) and FP(ff) can be defined as LOO based 
estimates. Then, smoothed LOO estimates of WER can be obtained as shown below. 

The WER measure is another useful measure for unbalanced datasets. Using the quan- 
tities defined above, its smoothed version can be written as: 

, . n+(l -TP(0)) +rn_FP(6>) , . 

WER(0;t) = — — . 11 

where r is the ratio of the cost of mis-classifications of the negative class to that of the 
positive class and < r < 1. Thus by choosing a suitable r value for a given problem 
and optimizing over the hyperparameters we can design classifiers without becoming biased 
toward one class. Not e that for ease of n otation we have omitted hat on TP(-) and FP(-). 



Following the work of iHong et al.l (|2007l ) one can also define 



jurat) = i+imzim (12) 

and optimize over the hyperparameters. As mentioned earlier, such a definition is not 
suitable for the GP classifier. Nevertheless it is interesting to note that it has the desirable 
property of trading-off between high TP and low FP. Also, on comparing this definition of 
AUC with (jlip we see that they are related in the sense that maximizing AUC is same as 
minimizing WER when r = 1 and n + = n_ . 

Overall we see that the LOO-CV predictive distributions can be used to define vari- 
ous criteria that are smooth functions of hyperparameters resulting in smoothed LOO-CV 
measures. Now given that the LOO-CV predictive distributions are readily available from 
the EP algorithm, we can optimize the various smoothed LOO-CV measures directly using 
standard non-linear optimization techniques. 

4. EP-CV Algorithm for Choosing Hyperparameters 

Various criteria such as ©, ()10p . (jTTJ) and (I12|) depend on the hyperparameters via the 
predictive distributions p(yi\xi, Sw, 6). With cumulative Gaussian likelihood, they can be 
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written as: 

p( yi \x u S v ,6) = $( ^ft±^ ) (13) 

Note that (fT3|) is obtained from ([1]) with p(fi\xi, S\i,0) = J\f(fi\i,a^ i ). The hyperparam- 
eter 7 is referred to as the bias parameter and it helps in shifting the decision bound- 
ary with the probability value h. In genera l, the bias hyperparameter 7 is very useful 
( Rasmussen and Williams] . 2006 : Seeger . 2003) and can be optimized. 



EP Approximation 

To compute (|13p we need the LOO mean fj,^ and variance <r? i Vz. With the EP approx- 
imation, they can be computed using (J3|). Full details of gradient calculations needed for 
implementing hyperparameter optimization are given in the appendix. 

We take the expectation-maximization (EM) type approach for hyperparameters op- 
timization. This is because gradient expressions involving implicit derivatives (with site 
parameters varying as a function of hyperparameters) are not available due to the iterative 
nature of the EP algorithm. This approach results in the following algorithm. 

EP-CV Algorithm: 

1. Initialize the hyperparameters 0. 

2. Perform E-Step: Given the hyperparameters, we find the site parameters fi and £ 
and the posterior q({\S,0) = M(m,C) using EP algorithm. 

3. Perform M-Step: Find the hyperparameters 9 by optimizing over any LOO-CV 
based measure like © , (fT0|) , (fTTj) or (fT2|) using any standard gradient based optimiza- 
tion technique. We carry out just one line search in this optimization process. During 
this line search as the hyperparameters change, we perform the following sequence of 
operations. 

(a) Compute the posterior mean m and covariance C using ([3]). 

(b) Compute the LOO mean /xy and variance cr^ using 

(c) Compute the chosen objective function like ©, (fT0|) . (fTT|) or (fT2j) and its deriva- 
tives. 

Note that through out this M-step, it is assumed that the site parameters are fixed 
and the values obtained from step (2) are used. 

4. repeat steps 2-3 until there is no significant change in the objective function value. 

This algorithm worke d well in our experiments. A similar EM approach was used by 
Kim and Ghahramani (j2006h (which they called EM-EP algorithm) in the optimization 



of a lower bound on the marginal likelihood. 

Since the EP-CV algorithm optimizes the smoothed F-measure it is useful to study the 
behavior of the true F-measure as optimization proceeds. We do this study on two of the 
datasets described in Table 6 of section 6. The optimization algorithm was terminated when 
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Figure 1: F- Measure Optimization on Car3 and Yeast7 Datasets 



there was no significant change in the smoothed F-measure value. From Figure 1 we see that 
the smoothed F-measure monotonically increases in both cases. Also as expected, the true 
F-measure exhibits a non-smooth behavior expected of a discrete function, and also, the 
values of true and smoothed F- measures are not the same. The difference arises because the 
smoothed F-measure is based on probabilistic scores, which can take any value between 0.5 
and 1 depending on the problem (even when correct classification occurs). The important 
point to observe is that, in general, there is an increasing trend in true F-measure value as 
the optimization progresses. In the case of Car 3 dataset (left panel) we see that clearly. A 
similar trend is seen in the case of Yeast7 dataset (right panel) also, except for a small dip 
at the 10th iteration. Though such a behavior happens sometimes in early iterations, we 
observed that better true F-measure value is almost always obtained as the optimization 
progresses. 



Computational and Storage Complexities 

The computational complexity of the EP-CV algorithm depends on the number of ARD 
kernel parameters D. See appendix for more details. For a given problem with D fixed, the 
complexity is 0(n 3 ). This complexity is same as that of the EP(ML) method (see equation 
([5])) and LS-CV(NLP) method given in the next section. Also in many practical problems 
a single global scaling hyperparameter for all the input features is sufficient. Finally, the 
storage complexities of all the methods are 0(n 2 ). 



5. Other LOO-CV based Methods 



Having discussed our approach in detail it is useful to recall and discuss other LOO-CV 
based GPC model selection methods in relation to it. 

Opper and Winther derived a mean field algorithm for binary classification with 



GPs based on the TAP approach originally proposed in the statistical physics of disordered 
systems. They showed that this approach yields an approximate LOO estimator for the gen- 
erali zation error. This estimate is equivalent to the LOO-CV error est imate obtained from 
EP (|Minkal . l200lh . Instead of optimizing over the hyperparameters, lOpper and Winther! 
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(j2000h used the LOO-CV error count (using indicator functions) to choose rough hyperpa- 



rameter values by scanning a range of values. 



Qi et al.l (|2004l ) used the LOO-CV error estimate obtained from EP to determine ARD 
hyperparameters. They worked with the Gaussian process classifier where each input feature 
is associated with a weight parameter V{ with the prior J\f(0, Ci -1 )- The hyperparameters 
£~ wer e obtained by maximizin g the evidence using a fast sequential update based on the 



work of lFaul and Tipping! (|2002l ). The outcome of this optimization is that many £jS would 
go to infinity such that only a few nonzero weights Vi will be present. Even though the 
ARD hyperpar ameters were optimized by maximizing the evidence, to prevent overfitting 



Qi et al l (j2004h proposed to select the final model as the one that gives the minimum LOO- 



CV error count or probability. As the LOO-CV error count is discrete, they chose the model 
with maximum evidence when there is a tie in the count. Compared to this approach, we 
work with the GP classifier model (without the weight parameters) detailed in Section 2 
and optimize over the hyperparameters (including ARD) directly with various LOO-CV 
based measures (including F-measure, WER etc.) using gradient information. 

In this context, the LOO-CV based probabilistic LS classifier ( Rasmussen and Williams] . 



2006) is a more direct LOO-CV based GPC model selection approach. For the sake of 



completion we give some details here and later compare our algorithms with this approach 
in our experiments. This approach treats classification as a regression problem. Note that 
the probabilistic interpretation of LS criterion implies a Gaussian noise model. But the 
output y can take only +1 or —1 which is slightly odd. However, this approach is simple 
to implement and a probabilistic interpretation is given by passing the predictions through 
a sigmoid. 

Specifically, the LOO mean fj,^ and variance <t^, i 6 / are obtained from LOO-CV 
formulation of GP regression and the predictive distributions are obtained via ()13j) . Here, 
fi\i and <7^ are given by: 

H\i = Vi - 5i<7^ <7^ = g- (14) 



where a = Ky and K = (K + AI) _1 . Here A can either be set to a small positive value 
or treated as a regularization hyperparameter with a small upper bound constraint. This is 
useful when K can become ill-conditioned during optimization. Finally, the hyperparame- 
ters are optimized using (jSJ). We call this method as LS-CV(NLP). 



6. Experiments 

We conducted two experiments with various methods. See Table 2 for a summary of the 
various methods. In the first experiment we compared the performance of EP-CV(NLP) 
method with that of EP(ML) and LS-CV(NLP) methods. In the second experiment we 
compared the performance of EP-CV(FM) method with that of EP-CV(NLP) and two 
step classifier methods. We used the minimize Matlab routine of the GPML Matlab code 
available at http://www.gaussianprocess.org/gpml/code/matlab/doc/ for hyperparameters 
optimization. In all the experiments we used a single global scaling hyperparameter. 



12 



Predictive approaches for Gaussian process classifier model selection 



Table 2: Various methods and their descriptions. All the EP-CV methods are optimized 
using EP-CV algorithm. 



Method 


Description 


EP(ML) 


Marginal likelihood maximization within EP approximation. 
That is, optimize fl5J over 0. 


EP-CV(NLP) 


Negative Logarithmic Predictive loss minimization within EP approximation. 
That is, optimize ([6]) over 9. For ease of notation, this 
method is referred as nlp in table 7. 


LS-CV(NLP) 


Negative Logarithmic Predictive loss minimization within Least Squares 
approximation as described in Section 5. Optimize ([6]) over 6. 


EP-CV(FM) 


F-Measure maximization within EP-CV APPROXIMATION. 
That is, optimize (fTUj) over 9. For ease of notation, this 

METHOD IS REFERRED AS FM IN TABLE 7. 


NLP-FM(BIAS) 


IN THE FIRST STEP, HYPERPARAMETERS ARE OPTIMIZED USING EP-CV(NLP) METHOD. 
In THE SECOND STEP, ONLY THE BIAS PARAMETER (7) IS OPTIMIZED USING 

EP-CV(FM) method. We also refer this method as two-step method. 



Table 3: Data sets description: NLP Experiment. Here, n, D, p and nr represent the num- 
bers of training examples, input dimension, test examples and train/test partitions 
respectively. 



Dataset 


n 


D 


P 


nr 


Banana 


400 


2 


4900 


100 


Breastcancer 


200 


9 


77 


100 


Diabetes 


468 


8 


300 


100 


German 


700 


20 


300 


100 


Heart 


170 


13 


100 


100 


Image 


1300 


18 


1010 


20 


Ringnorm 


400 


20 


7000 


100 


Splice 


1000 


60 


2175 


20 


Thyroid 


140 


5 


75 


100 


Titanic 


150 


3 


2051 


100 


Twonorm 


400 


20 


7000 


100 


Waveform 


400 


21 


4600 


100 
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Table 4: NLP Performance 



Dataset/ Method 


EP(ML) 


EP-CV(NLP) 


LS-CV(NLP) 


Banana 


23.90 ± 0.81 


24.26 ± 1.06 


33.88 ± 1.89 


Breastcancer 


53.57 ± 4.75 


54.12 ± 5.27 


55.58 ± 5.03 


Diabetes 


47.74 ± 1.96 


47.97 ± 2.09 


50.72 ± 2.13 


German 


48.67 ± 2.74 


49.05 ± 2.76 


50.51 ± 2.30 


Heart 


40.16 ± 5.36 


40.03 ± 5.00 


45.11 ± 4.91 


Image 


8.26 ± 1.07 


8.45 ± 0.97 


22.70 ± 0.66 


Ringnorm 


16.88 ±0.93 


16.56 ± 1.01 


28.48 ± 0.75 


Solar 


57.25 ± 1.38 


57.35 ± 1.42 


59.61 ± 1.34 


Splice 


28.48 ± 0.88 


29.60 ± 0.79 


36.83 ± 0.42 


Thyroid 


10.21 ± 3.76 


9.94 ± 3.69 


25.33 ± 4.86 


Titanic 


66.86 ± 1.97 


51.73 ± 1.73 


53.78 ± 14.08 


Twonorm 


8.31 ± 0.88 


9.08 ± 1.97 


25.94 ± 0.53 


Waveform 


23.01 ± 0.89 


22.97 ± 0.67 


32.63 ± 0.59 



Table 5: Test Set Error Performance 



Dataset /Method 


EP(ML) 


EP-CV(NLP) 


LS-CV(NLP) 


Banana 


10.41 ± 0.65 


10.51 ± 0.50 


10.93 ± 0.67 


Breastcancer 


26.52 ± 4.89 


26.61 ± 4.80 


25.94 ± 4.59 


Diabetes 


23.28 ± 1.82 


23.41 ± 1.82 


24.30 ± 2.51 


German 


23.36 ± 2.11 


23.48 ± 2.00 


23.94 ± 2.33 


Heart 


16.65 ± 2.87 


16.62 ± 3.08 


17.91 ± 4.21 


Image 


2.82 ± 0.54 


2.77 ± 0.51 


2.74 ± 0.65 


Ringnorm 


4.41 ± 0.64 


4.29 ± 0.69 


5.05 ± 0.99 


Solar 


34.20 ± 1.75 


34.27 ± 1.80 


35.03 ± 1.89 


Splice 


11.61 ± 0.81 


11.85 ± 0.83 


11.83 ± 0.80 


Thyroid 


4.37 ± 2.19 


4.20 ± 2.17 


6.97 ± 3.78 


Titanic 


22.64 ± 1.34 


22.50 ± 0.98 


22.99 ± 2.81 


Twonorm 


3.05 ± 0.34 


3.19 ± 0.51 


3.43 ± 0.43 


Waveform 


10.10 ± 0.48 


9.95 ± 0.48 


11.70 ± 0.88 
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Table 6: Data sets description: F-Measure Experiment. Here, n, D, p, nr and PPE repre- 
sent the numbers of training examples, input dimension, test examples, train/test 
partitions and approximate percentage of positive examples respectively. 



Dataset 


n 


D 


P 


nr 


PPE 


Yeast7 


297 


8 


1187 


50 


2 


Yeast5 


320 


8 


1164 


50 


4 


Car3 


350 


6 


1378 


50 


4 


Ecoli5 


124 


7 


212 


50 


6 


Yeast4 


165 


8 


1319 


50 


11 


Breastcancer 


200 


9 


77 


100 


29 


German 


700 


20 


300 


100 


30 


Diabetes 


468 


8 


300 


100 


35 



Table 7: F-Measure Performance 



Dataset /Method 


NLP 


FM 


NLP-FM(bias) 


YEAST7 


32.24 ± 15.54 


42.58 ± 7.84 


40.85 ± 8.59 


YEAST5 


19.79 ± 12.85 


32.85 ± 8.56 


28.45 ± 10.57 


car3 


55.89 ± 9.30 


64.35 ± 8.51 


62.67 ± 8.49 


ECOLI5 


84.41 ± 6.25 


83.79 ± 5.90 


84.08 ± 5.86 


YEAST4 


71.35 ± 3.95 


73.64 ± 2.97 


73.23 ± 2.81 


BreastCancer 


38.42 ± 10.55 


47.34 ± 6.44 


46.98 ± 6.37 


German 


54.15 ± 4.17 


57.53 ± 2.95 


56.91 ± 2.87 


Diabetes 


62.69 ± 3.49 


66.23 ± 2.87 


66.12 ± 2.67 
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6.1 NLP Experiment 

In this experiment we used the thirteen benchmark datasets available in the wetH summa- 
rized in Table 3. Let us first consider the results from the first experiment given in Table 
4 and Table 5. For the EP(ML) method we used the GPML Matlab code available in the 
wetfl 

We conducted Friedman test ( Demsai . 20061 ) with the corresponding post-hoc tests for 



comparison of classifiers over multiple datasets. The compari son over multip le datasets 
requires a performance score of each method on each dataset ( Demsai . 20061 ). Here, we 



consider the mean over the partitions of a given dataset as the performance score. As 



pointed out in iDemsart (J2006J) , it is not clear how to make use of the standard deviation 



information when the datasets are not independent over the partitions. The Friedman test 
ranks the methods for each dataset separately based on the chosen performance score (mean 
performance in our case). The best performing method gets the rank of 1, the second best 
rank 2 and so on. In case of ties, average ranks are assigned. The Friedman test checks 
whether the measured average ranks (over the datasets) are significantly different from 
the mean rank under the null hypothesis. Under the null hypothesis all the methods are 
equivalent and so their ranks should be equal. 

In the case of NLP performance measure, the measured average ranks for the EP(ML), 
EP-CV(NLP) and LS-CV(NLP) methods were 1.46, 1.62 and 2.92 respectively. With three 
methods and 13 datasets, the F-statistic comparison at a significance level of 0.05 rejected 
the null hypothesis. Since the null hypothesis was rejected we conducted the Nemenyi post- 
hoc test for pairwise comparisons. This test revealed that the results of the EP(ML) and 
EP-CV(NLP) methods are better than the LS-CV(NLP) method at the significance level of 
0.05. On the other hand, the post-hoc test did not detect any significant difference in the 
results of EP(ML) and EP-CV(NLP) methods. In the case of test set performance measure, 
the measure averaged ranks for the EP(ML), EP-CV(NLP) and LS-CV(NLP) methods were 
1.62, 1.77 and 2.62 respectively. Note that the average rank of LS-CV method has improved 
on the test set error performance. Here again, the null hypothesis is rejected at the same 
significance level and the post-hoc test did not detect any significant difference in the results 
of EP(ML) and EP-CV(NLP) methods. The results of the EP(ML) and EP-CV(NLP) 
methods are better than the LS-CV(NLP) method at the significance level of 0.05 and 0.1 
respectively. Thus, we can conclude that EP(ML) and EP-CV(NLP) are competitive to 
each other. Further, both these methods perform better than the LS-CV(NLP) method in 
this experiment. 

We can also make other observations from the tables. We note that the NLP perfor- 
mance of the LS-CV(NLP) method is quite inferior on several datasets even though its 
test set error performances on most of these datasets (except waveform and thyroid) are 
relatively closer. Further, some kind of group behavior can be seen. For example, the 
NLP scores are high on titanic, breast- cancer, diabetes, German and flare-solar. Also, the 
test set errors are > 20% on these datasets. Consequently, we may consider these datasets 
as difficult ones. From Table 4 we observe that the NLP performance of LS-CV(NLP) 
is closer to the other two methods on these datasets (compared to its performance on 



3. http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm 

4. http://www.gaussianprocess.org/gpml/code/matlab/doc/ 
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other datasets). Next we can order the remaining datasets heart, splice, banana, waveform, 
ringnorm, thyroid, twonorm and image in terms of descending difficulty. Note that the dif- 
ference in the NLP performance seems to have an increasing trend as the dataset becomes 
easier. To understand this we looked at the predictive probabilities of these methods on 
both correctly and wrongly classified examples. In the case of thyroid dataset these average 
probability scores for the LS-CV(NLP) method were 0.84 (for correct classification) and 
0.35 (wrong classification). Here, the averaging was done over all the correctly(wrongly) 
classified examples over all the partitions. On the other hand, the corresponding values 
for the EP-CV(NLP) were (0.96,0.39). In the case of banana dataset, these scores were 
(0.79,0.35) and (0.92,0.29) for the LS-CV(NLP) and EP-CV(NLP) methods respectively. 
In the case of German dataset, they were (0.75, 0.33) and (0.79, 0.32). The scores for the EP- 
ML method were very close to that of the EP-CV(NLP) method. In general, we observed 
that the predictive probability estimates from the LS-CV(NLP) method were relatively poor 
and resulted in poor NLP performance. We looked at the hyperparameter estimates of the 
different methods and observed that the LS-CV(NLP) method takes smaller width and sig- 
nal variance hyperparameter values on most of the datasets except on the difficult datasets 
(mentioned above) compared to the other two methods. Apart from this we did not observe 
any specific pattern in the hyperparameter values chosen by these methods. Looking at the 
hyperparameter estimates of EP(ML) and EP-CV(NLP), it seems that several solutions in 
the space of hyperparameters that give close performances are possible. 



6.2 F-measure Experiment 

The datasets used in this experiment are described in Table 6. The datasets yeast, car and 
ecoli are multi-class datasets and we converted them into binary classification datasets by 
considering examples belonging to the class label indicated by the number (for example, 7 
in yeast7) as positive class respectively and treating the rest of the examples as negative 
class. These datasets are available in the webU- We created 50 partitions for these datasets 
in a stratified manner reflecting the class distributions. 

Let us consider the results from the second experiment given in Table 7. In this ex- 
periment all the results were obtained within the EP-CV framework. The first and second 
columns represent results obtained using NLP and smoothed F-measure (i.e., eq. (|10p ) as 
the optimization criterion respectively. The third column represents results obtained from 
the two step classifier described earlier. We looked at the hyperparameter estimates of 
the different methods. We observed that the bias estimates of the two step classifier were 
somewhat closer (within 10%) to those of the smoothed F-measure method on the breast- 
cancer, diabetes and German datasets. On the remaining datasets they were different by 
more than 40%. The width and signal variance hyperparameter estimates were also quite 
different. From Table 7, we observe that the two step method is also good and gives closer 
performance to the smoothed F-measure method on several datasets. Further analysis of 
the performance results revealed that even though the standard deviations are high, the 
smoothed F-measure method gave better performance than the two step method on major- 
ity of the partitions on several datasets. We believe that the larger standard deviations in 
the results arises from the sensitivity to the dataset with lesser number of positive examples. 



5. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/ 
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To carry out the statistical significance tests, again we used the mean (over the partitions) 
as the performance score for each of the methods. The measured average ranks for these 
three methods were 2.75, 1.25 and 2.00 respectively. With three methods and 8 datasets, 
the F-statistic comparison at a significance level of 0.05 rejected the null hypothesis. The 
Nemenyi post-hoc test for pairwise comparisons revealed that only the results of smoothed 
F-measure is better than the NLP based method at the significance level of 0.05. In this 
experiment the post-hoc test did not detect any significant differences in the comparisons 
of smoothed F-measure method with the two step method and the two step method with 
the NLP method. However in these two comparisons the rank differences were closer to 
the required critical differences at the significance level of 0.1. We also observed that if 
we were to conduct Wilcoxon signed-rank test on these methods (as if we were comparing 
only two classifiers) then the results were statistically significant at the significance level of 
0.05 for all the three pairs. In summary, the results demonstrate the usefulness of direct 
optimization of smoothed F-measure. 



7. Conclusion 

In this paper, we considered the problem of Gaussian process classifier model selection with 
different LOO-CV based optimization criteria and provided a practical algorithm using LOO 
predictive distributions with criteria like standard NLP, smoothed F-measure and WER to 
select hyperparameters. More specifically, apart from optimization of standard NLP, we 
demonstrated its usefulness in direct optimization of smoothed F-measure, which is useful to 
handle imbalanced data. We considered predictive distribution arrived from the Expectation 
Propagation (EP) approximation. We derived relevant expressions and proposed a very 
useful EP-CV algorithm. The experimental results on several real world benchmark datasets 
showed comparable NLP generalization performance (with NLP optimization) with existing 
approaches. We demonstrated that the smoothed F-measure optimization method is a very 
useful method that improves the F-measure performance significantly. Overall, the EP-CV 
algorithm is an excellent choice for GP classifier model selection with different LOO-CV 
based optimization criteria. 
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Appendix: Optimization criteria, approximate predictive distributions 
and their derivatives 



From the definitions of various optimization criteria like NLP measure ([6]) and (jlOp etc, we 
see that the chosen measure and its derivatives can be obtained from the LOO predictive 
distributions p(yi\x i: S\ i: 6) and their derivatives dpfal^A"^) _ jjere, Q. j s j th component of 
the hyperparameter vector 6. Note that the derivative of the NLP measure © is given by: 

dG(0) lA 1 dpjyilxu^e) 
and the derivative of the smoothed F-measure ()10p is given by: 

ojm = - aw - o*$& 

89 j rj 2 (0) 

where rj(6) = £n + + (1 — £)m + (0). Note that the derivatives dJ ^^ and drri QQ^ are 

dp(yi=+i\xi,S\i,9) 



(16) 



directly dependent on — de " x " — . Now, from (|T3I) we see that to define the LOO 

predictive distributions, we need the LOO mean fj,^ and variance a^-. In the case of EP 
approximation, analytical expressions to compute these quantities are already available (see 
eqn. (j3J)). Next, we give details on how the derivatives of predictive distributions can be 
obtained with these approximations. 

Derivatives of Predictive Distributions 

For ease of reference we recall (1131) here. 



p(yi\xi,S\i,0) = $ , - 

1 + °\i 

Then, with z; = , „ and Nlzi) = -7==exp(— ^f) we have 



dp(yi\xi,S\i,0) _ N( Zi )yi / d^y ^ 1 Zj da \i 
Here, 6j represents any element of other than 7. Similarly, we have 



dp(yi\Xi,S\i,e) N{zi)yi 



91 0^4 

du\ da?. 

Thus, we need -q^- and -g^-- Below, we give details on how they can be obtained with the 
EP approximation. 



20 



Predictive approaches for Gaussian process classifier model selection 



Derivatives of LOO mean and variance with fixed site parameters: EP 
Approximation 



In the case of EP approximation, since we resort to EM type optimization, we derive 
xpressions for an 
2 / m _ 4) we h a ve 

\* v Ca erf ' ' 



expressions for -q^ 1 and -g^- assuming that the site parameters are fixed. From /x\ 



d^\i M\< da \i % („ dvcii dCi 



dOj a\ dOj (Cu) 2 V dOj d9j 
From 0-2. = ((C^)" 1 - err 2 )- 1 , we have 

Since m = CS _1 /i, we have = Note that C can be re-written using Sherman- 

Morrison- Woodbury formula as: C = K — K(K + S) -1 K and it is useful to work w ith 
this expression to achieve improved numerical stability (jRasmussen and Williams! . l200d ) as 
it avoids inversion of K. Then we have 

wr {I - {K+srlK)T ^ (I - {K+n 

Note that ^0-, i € I (where / = {1, 2, . . . , n}) are nothing but the diagonal entries of the 
above expression. Note also that ^ can be efficiently computed by taking advantage of 
the presence of the vector S -1 //. But, to compute ^j^S i £ I we cannot avoid the matrix 
multiplication with this results in 0(n 3 ) for each 9j. Finally, it is useful to re- write 
(K + S)- 1 = S~3(I + S"5KS"5)^s4 dRasmussen and Williams! . l2006h . 
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